Duplicate detection - step 1: find the potential duplicates¶
This notebook runs the first part of the duplicate detection algorithm on a dataframe with the following columns:
archiveType(used for duplicate detection algorithm)dataSetNamedatasetIdgeo_meanElev(used for duplicate detection algorithm)geo_meanLat(used for duplicate detection algorithm)geo_meanLon(used for duplicate detection algorithm)geo_siteName(used for duplicate detection algorithm)interpretation_directioninterpretation_seasonalityinterpretation_variableinterpretation_variableDetailsoriginalDataURLoriginalDatabasepaleoData_notespaleoData_proxy(used for duplicate detection algorithm)paleoData_unitspaleoData_values(used for duplicate detection algorithm, test for correlation, RMSE, correlation of 1st difference, RMSE of 1st difference)paleoData_variableNameyear(used for duplicate detection algorithm)yearUnits
The key function for duplicate detection is find_duplicates in f_duplicate_search.py
The output is saved as csvs in the directory data/DATABASENAME/dup_detection, which are used again for step 2 (dup_decisions.py):
pot_dup_correlations_DATABASENAME.csv- matrix of correlations between each pair
pot_dup_distances_km_DATABASENAME.csv- matrix of distances between each pair
pot_dup_IDs_DATABASENAME.csv- saves the IDs of each pair
pot_dup_indices_DATABASENAME.csv- saves the dataframe indices of each pair
Summary figures of the potential duplicate pairs are created and the plots are saved in the same directory, following: duplicatenumber_ID1_ID2_index1_index2.jpg
Updates:
- 06/11/2025 by LL: Tidied up and updated for DoD2k v2.0
- 27/11/2024 by LL: Fixed a bug in find_duplicates (in f_duplicate_search) and relaxed site criteria.
27/9/2024 created by LL
Author: Lucie J. Luecke
Set up working environment¶
Make sure the repo_root is added correctly, it should be: your_root_dir/dod2k This should be the working directory throughout this notebook (and all other notebooks).
%load_ext autoreload
%autoreload 2
import sys
import os
from pathlib import Path
# Add parent directory to path (works from any notebook in notebooks/)
# the repo_root should be the parent directory of the notebooks folder
current_dir = Path().resolve()
# Determine repo root
if current_dir.name == 'dod2k': repo_root = current_dir
elif current_dir.parent.name == 'dod2k': repo_root = current_dir.parent
else: raise Exception('Please review the repo root structure (see first cell).')
# Update cwd and path only if needed
if os.getcwd() != str(repo_root):
os.chdir(repo_root)
if str(repo_root) not in sys.path:
sys.path.insert(0, str(repo_root))
print(f"Repo root: {repo_root}")
if str(os.getcwd())==str(repo_root):
print(f"Working directory matches repo root. ")
Repo root: /home/jupyter-lluecke/dod2k Working directory matches repo root.
import pandas as pd
import numpy as np
from dod2k_utilities import ut_functions as utf # contains utility functions
from dod2k_utilities import ut_duplicate_search as dup # contains utility functions
Load dataset¶
Define the dataset which needs to be screened for duplicates. Input files for the duplicate detection mechanism need to be compact dataframes (pandas dataframes with standardised columns and entry formatting).
The function load_compact_dataframe_from_csv loads the dataframe from a csv file from data\DB\, with DB the name of the database. The database name (db_name) can be
pages2kch2kiso2ksisalfe23
for the individual databases, or
all_merged
to load the merged database of all individual databases, or can be any user defined compact dataframe.
# load dataframe
db_name='all_merged'
# db_name = 'dup_test'
# db_name='ch2k'
df = utf.load_compact_dataframe_from_csv(db_name)
print(df.info())
df.name = db_name
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5320 entries, 0 to 5319 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 archiveType 5320 non-null object 1 dataSetName 5320 non-null object 2 datasetId 5320 non-null object 3 geo_meanElev 5221 non-null float32 4 geo_meanLat 5320 non-null float32 5 geo_meanLon 5320 non-null float32 6 geo_siteName 5320 non-null object 7 interpretation_direction 5320 non-null object 8 interpretation_seasonality 5320 non-null object 9 interpretation_variable 5320 non-null object 10 interpretation_variableDetail 5320 non-null object 11 originalDataURL 5320 non-null object 12 originalDatabase 5320 non-null object 13 paleoData_notes 5320 non-null object 14 paleoData_proxy 5320 non-null object 15 paleoData_sensorSpecies 5320 non-null object 16 paleoData_units 5320 non-null object 17 paleoData_values 5320 non-null object 18 paleoData_variableName 5320 non-null object 19 year 5320 non-null object 20 yearUnits 5320 non-null object dtypes: float32(3), object(18) memory usage: 810.6+ KB None
Duplicate Detection¶
Find duplicates¶
Now run the first part of the duplicate detection algorithm, which goes through each candidate pair and evaluates the pairs for the following criteria:
- metadata criteria:
- archive types (
archiveType) must be identical - proxy types (
paleoData_proxy) must be identical
- archive types (
- geographical criteria:
- elevation (
geo_meanElev) similar, within defined tolerance (use kwargelevation_tolerance, defaults to 0) - latitude and longtitude (
geo_meanLatandgeo_meanLon) similar, within defined tolerance in km (use kwargdist_tolerance_km, defaults to 8 km)
- elevation (
- overlap criterion:
- time must overlap for at least $n$ points (use kwarg
n_points_threshto modify, defaults to $n=10$) unless at least one of the record is shorter thann_points_thresh
- time must overlap for at least $n$ points (use kwarg
- site criterion:
- there must be some overlap in the site name (
geo_siteName)
- there must be some overlap in the site name (
- correlation criteria:
- correlation between the overlapping period must be greater than defined threshold (use
corr_threshto modify, defaults to 0.9) or correlation of first difference must be greater than defined threshold (usecorr_diff_threshto modify, defaults to 0.9) - RMSE of overlapping period must be smaller than defined threshold (use
rmse_threshto modify, defaults to 0.1) or RMSE of first difference must be smaller than defined threshold (usermse_diff_threshto modify, defaults to 0.1)
- correlation between the overlapping period must be greater than defined threshold (use
- URL criterion:
- URLs (
originalDataURL) must be identical if both records originate from the same database (originalDatabasemust be identical)
- URLs (
A potential duplicate candidate pair is flagged, if all of these criteria are satisfied OR the correlation between the candidates is particularly high (>0.98), while there is sufficient overlap (as defined by the overlap criterion).
The output for a database named DB is saved under data/DB/dup_detection/dup_detection_candidates_DB.csv.
## run the find duplicate algorithm
dup.find_duplicates_optimized(df, n_points_thresh=10)
all_merged Start duplicate search: ================================= checking parameters: proxy archive : must match proxy type : must match distance (km) < 8 elevation : must match time overlap > 10 correlation > 0.9 RMSE < 0.1 1st difference rmse < 0.1 correlation of 1st difference > 0.9 ================================= Start duplicate search Progress: 0/5320 --> Found potential duplicate: 0: pages2k_0&4408: iso2k_296 (n_potential_duplicates=1) --> Found potential duplicate: 0: pages2k_0&4409: iso2k_298 (n_potential_duplicates=2) --> Found potential duplicate: 0: pages2k_0&4410: iso2k_299 (n_potential_duplicates=3) --> Found potential duplicate: 2: pages2k_6&3037: fe23_northamerica_usa_az555 (n_potential_duplicates=4) Progress: 10/5320 --> Found potential duplicate: 17: pages2k_50&1587: fe23_northamerica_canada_cana091 (n_potential_duplicates=5) Progress: 20/5320 --> Found potential duplicate: 20: pages2k_62&21: pages2k_63 (n_potential_duplicates=6) --> Found potential duplicate: 29: pages2k_81&4146: ch2k_he08lra01_76 (n_potential_duplicates=7) --> Found potential duplicate: 29: pages2k_81&4736: iso2k_1813 (n_potential_duplicates=8) Progress: 30/5320 --> Found potential duplicate: 30: pages2k_83&4767: iso2k_1916 (n_potential_duplicates=9) --> Found potential duplicate: 31: pages2k_85&32: pages2k_88 (n_potential_duplicates=10) --> Found potential duplicate: 34: pages2k_94&1639: fe23_northamerica_canada_cana153 (n_potential_duplicates=11) --> Found potential duplicate: 38: pages2k_107&2809: fe23_northamerica_usa_ak046 (n_potential_duplicates=12) Progress: 40/5320 --> Found potential duplicate: 44: pages2k_121&45: pages2k_122 (n_potential_duplicates=13) Progress: 50/5320 --> Found potential duplicate: 50: pages2k_132&1706: fe23_northamerica_canada_cana225 (n_potential_duplicates=14) --> Found potential duplicate: 58: pages2k_158&4018: fe23_northamerica_usa_wa069 (n_potential_duplicates=15) Progress: 60/5320 --> Found potential duplicate: 62: pages2k_171&4102: fe23_northamerica_usa_wy021 (n_potential_duplicates=16) Progress: 70/5320 --> Found potential duplicate: 72: pages2k_203&4526: iso2k_826 (n_potential_duplicates=17) Progress: 80/5320 --> Found potential duplicate: 82: pages2k_225&3699: fe23_northamerica_usa_nv512 (n_potential_duplicates=18) --> Found potential duplicate: 86: pages2k_238&4576: iso2k_1044 (n_potential_duplicates=19) --> Found potential duplicate: 88: pages2k_242&4328: ch2k_li06fij01_582 (n_potential_duplicates=20) --> Found potential duplicate: 88: pages2k_242&4423: iso2k_353 (n_potential_duplicates=21) Progress: 90/5320 --> Found potential duplicate: 94: pages2k_258&4660: iso2k_1498 (n_potential_duplicates=22) --> Found potential duplicate: 97: pages2k_263&4629: iso2k_1322 (n_potential_duplicates=23) --> Found potential duplicate: 99: pages2k_267&4352: iso2k_58 (n_potential_duplicates=24) --> Found potential duplicate: 99: pages2k_267&4581: iso2k_1068 (n_potential_duplicates=25) Progress: 100/5320 --> Found potential duplicate: 101: pages2k_271&4306: ch2k_fe18rus01_492 (n_potential_duplicates=26) --> Found potential duplicate: 101: pages2k_271&4753: iso2k_1861 (n_potential_duplicates=27) --> Found potential duplicate: 102: pages2k_273&2669: fe23_asia_russ130w (n_potential_duplicates=28) --> Found potential duplicate: 105: pages2k_281&1641: fe23_northamerica_canada_cana155 (n_potential_duplicates=29) --> Found potential duplicate: 109: pages2k_294&2784: fe23_northamerica_usa_ak021 (n_potential_duplicates=30) Progress: 110/5320 --> Found potential duplicate: 113: pages2k_305&115: pages2k_309 (n_potential_duplicates=31) --> Found potential duplicate: 114: pages2k_307&116: pages2k_311 (n_potential_duplicates=32) --> Found potential duplicate: 118: pages2k_315&4425: iso2k_362 (n_potential_duplicates=33) --> Found potential duplicate: 119: pages2k_317&4148: ch2k_na09mal01_84 (n_potential_duplicates=34) --> Found potential duplicate: 119: pages2k_317&4722: iso2k_1754 (n_potential_duplicates=35) Progress: 120/5320 --> Found potential duplicate: 121: pages2k_323&1691: fe23_northamerica_canada_cana210 (n_potential_duplicates=36) Progress: 130/5320 Progress: 140/5320 --> Found potential duplicate: 142: pages2k_385&4236: ch2k_fe09oga01_304 (n_potential_duplicates=37) --> Found potential duplicate: 142: pages2k_385&4769: iso2k_1922 (n_potential_duplicates=38) --> Found potential duplicate: 143: pages2k_387&4237: ch2k_fe09oga01_306 (n_potential_duplicates=39) --> Found potential duplicate: 148: pages2k_395&4271: ch2k_ca07fli01_400 (n_potential_duplicates=40) --> Found potential duplicate: 148: pages2k_395&4579: iso2k_1057 (n_potential_duplicates=41) --> Found potential duplicate: 149: pages2k_397&4272: ch2k_ca07fli01_402 (n_potential_duplicates=42) Progress: 150/5320 --> Found potential duplicate: 154: pages2k_409&4280: ch2k_qu96esv01_422 (n_potential_duplicates=43) --> Found potential duplicate: 154: pages2k_409&4386: iso2k_218 (n_potential_duplicates=44) --> Found potential duplicate: 156: pages2k_414&158: pages2k_418 (n_potential_duplicates=45) --> Found potential duplicate: 157: pages2k_417&159: pages2k_421 (n_potential_duplicates=46) Progress: 160/5320 --> Found potential duplicate: 165: pages2k_427&171: pages2k_433 (n_potential_duplicates=47) Progress: 170/5320 --> Found potential duplicate: 173: pages2k_435&325: pages2k_842 (n_potential_duplicates=48) --> Found potential duplicate: 176: pages2k_444&177: pages2k_445 (n_potential_duplicates=49) --> Found potential duplicate: 176: pages2k_444&178: pages2k_446 (n_potential_duplicates=50) --> Found potential duplicate: 177: pages2k_445&178: pages2k_446 (n_potential_duplicates=51) Progress: 180/5320 --> Found potential duplicate: 184: pages2k_462&4208: ch2k_os14ucp01_236 (n_potential_duplicates=52) --> Found potential duplicate: 184: pages2k_462&4422: iso2k_350 (n_potential_duplicates=53) --> Found potential duplicate: 187: pages2k_468&1310: pages2k_3550 (n_potential_duplicates=54) --> Found potential duplicate: 187: pages2k_468&2676: fe23_asia_russ137w (n_potential_duplicates=55) --> Found potential duplicate: 189: pages2k_472&190: pages2k_474 (n_potential_duplicates=56) --> Found potential duplicate: 189: pages2k_472&192: pages2k_477 (n_potential_duplicates=57) Progress: 190/5320 --> Found potential duplicate: 190: pages2k_474&192: pages2k_477 (n_potential_duplicates=58) --> Found potential duplicate: 193: pages2k_478&4744: iso2k_1846 (n_potential_duplicates=59) --> Found potential duplicate: 196: pages2k_486&3157: fe23_northamerica_usa_ca609 (n_potential_duplicates=60) --> Found potential duplicate: 199: pages2k_495&4123: ch2k_li06rar01_12 (n_potential_duplicates=61) --> Found potential duplicate: 199: pages2k_495&4662: iso2k_1502 (n_potential_duplicates=62) Progress: 200/5320 --> Found potential duplicate: 202: pages2k_500&4235: ch2k_as05gua01_302 (n_potential_duplicates=63) --> Found potential duplicate: 202: pages2k_500&4675: iso2k_1559 (n_potential_duplicates=64) Progress: 210/5320 --> Found potential duplicate: 216: pages2k_541&4431: iso2k_404 (n_potential_duplicates=65) --> Found potential duplicate: 217: pages2k_543&373: pages2k_976 (n_potential_duplicates=66) Progress: 220/5320 --> Found potential duplicate: 224: pages2k_565&4568: iso2k_998 (n_potential_duplicates=67) Progress: 230/5320 --> Found potential duplicate: 233: pages2k_583&3550: fe23_northamerica_usa_mt116 (n_potential_duplicates=68) --> Found potential duplicate: 236: pages2k_592&4221: ch2k_li06rar02_270 (n_potential_duplicates=69) --> Found potential duplicate: 236: pages2k_592&4661: iso2k_1500 (n_potential_duplicates=70) Progress: 240/5320 --> Found potential duplicate: 243: pages2k_610&4601: iso2k_1199 (n_potential_duplicates=71) Progress: 250/5320 --> Found potential duplicate: 250: pages2k_626&4020: fe23_northamerica_usa_wa071 (n_potential_duplicates=72) Progress: 260/5320 Progress: 270/5320 --> Found potential duplicate: 272: pages2k_691&1558: fe23_northamerica_canada_cana062 (n_potential_duplicates=73) Progress: 280/5320 --> Found potential duplicate: 285: pages2k_730&4429: iso2k_396 (n_potential_duplicates=74) --> Found potential duplicate: 287: pages2k_736&4105: fe23_northamerica_usa_wy024 (n_potential_duplicates=75) Progress: 290/5320 Progress: 300/5320 --> Found potential duplicate: 307: pages2k_800&1715: fe23_northamerica_canada_cana234 (n_potential_duplicates=76) Progress: 310/5320 --> Found potential duplicate: 312: pages2k_818&4451: iso2k_488 (n_potential_duplicates=77) --> Found potential duplicate: 317: pages2k_827&319: pages2k_830 (n_potential_duplicates=78) Progress: 320/5320 --> Found potential duplicate: 320: pages2k_831&813: pages2k_2220 (n_potential_duplicates=79) --> Found potential duplicate: 320: pages2k_831&2666: fe23_asia_russ127w (n_potential_duplicates=80) Progress: 330/5320 --> Found potential duplicate: 331: pages2k_857&3909: fe23_northamerica_usa_ut511 (n_potential_duplicates=81) --> Found potential duplicate: 339: pages2k_881&4570: iso2k_1010 (n_potential_duplicates=82) Progress: 340/5320 --> Found potential duplicate: 342: pages2k_893&343: pages2k_895 (n_potential_duplicates=83) --> Found potential duplicate: 342: pages2k_893&345: pages2k_900 (n_potential_duplicates=84) --> Found potential duplicate: 343: pages2k_895&345: pages2k_900 (n_potential_duplicates=85) Progress: 350/5320 --> Found potential duplicate: 358: pages2k_940&4219: ch2k_dr99abr01_264 (n_potential_duplicates=86) --> Found potential duplicate: 358: pages2k_940&4220: ch2k_dr99abr01_266 (n_potential_duplicates=87) --> Found potential duplicate: 358: pages2k_940&4361: iso2k_91 (n_potential_duplicates=88) Progress: 360/5320 --> Found potential duplicate: 361: pages2k_945&4364: iso2k_100 (n_potential_duplicates=89) --> Found potential duplicate: 366: pages2k_960&4494: iso2k_641 (n_potential_duplicates=90) Progress: 370/5320 --> Found potential duplicate: 375: pages2k_982&3767: fe23_northamerica_usa_or042 (n_potential_duplicates=91) Progress: 380/5320 --> Found potential duplicate: 382: pages2k_1004&4495: iso2k_644 (n_potential_duplicates=92) --> Found potential duplicate: 389: pages2k_1026&3035: fe23_northamerica_usa_az553 (n_potential_duplicates=93) Progress: 390/5320 --> Found potential duplicate: 396: pages2k_1048&4605: iso2k_1212 (n_potential_duplicates=94) Progress: 400/5320 --> Found potential duplicate: 407: pages2k_1089&3547: fe23_northamerica_usa_mt112 (n_potential_duplicates=95) --> Found potential duplicate: 407: pages2k_1089&3548: fe23_northamerica_usa_mt113 (n_potential_duplicates=96) Progress: 410/5320 --> Found potential duplicate: 415: pages2k_1108&4580: iso2k_1060 (n_potential_duplicates=97) --> Found potential duplicate: 418: pages2k_1116&1651: fe23_northamerica_canada_cana170w (n_potential_duplicates=98) Progress: 420/5320 --> Found potential duplicate: 428: pages2k_1147&4147: ch2k_da06maf01_78 (n_potential_duplicates=99) --> Found potential duplicate: 428: pages2k_1147&4153: ch2k_da06maf02_104 (n_potential_duplicates=100) --> Found potential duplicate: 428: pages2k_1147&4719: iso2k_1748 (n_potential_duplicates=101) Progress: 430/5320 --> Found potential duplicate: 431: pages2k_1153&432: pages2k_1156 (n_potential_duplicates=102) --> Found potential duplicate: 431: pages2k_1153&434: pages2k_1160 (n_potential_duplicates=103) --> Found potential duplicate: 432: pages2k_1156&434: pages2k_1160 (n_potential_duplicates=104) Progress: 440/5320 Progress: 450/5320 --> Found potential duplicate: 453: pages2k_1209&3275: fe23_northamerica_usa_co553 (n_potential_duplicates=105) Progress: 460/5320 --> Found potential duplicate: 467: pages2k_1252&1592: fe23_northamerica_canada_cana096 (n_potential_duplicates=106) Progress: 470/5320 --> Found potential duplicate: 474: pages2k_1274&4682: iso2k_1577 (n_potential_duplicates=107) Progress: 480/5320 --> Found potential duplicate: 481: pages2k_1293&4524: iso2k_821 (n_potential_duplicates=108) Progress: 490/5320 --> Found potential duplicate: 491: pages2k_1325&4111: fe23_northamerica_usa_wy030 (n_potential_duplicates=109) Progress: 500/5320 --> Found potential duplicate: 502: pages2k_1360&4126: ch2k_ur00mai01_22 (n_potential_duplicates=110) --> Found potential duplicate: 502: pages2k_1360&4362: iso2k_94 (n_potential_duplicates=111) --> Found potential duplicate: 502: pages2k_1360&4363: iso2k_98 (n_potential_duplicates=112) --> Found potential duplicate: 503: pages2k_1362&504: pages2k_1365 (n_potential_duplicates=113) --> Found potential duplicate: 505: pages2k_1370&4689: iso2k_1619 (n_potential_duplicates=114) Progress: 510/5320 Progress: 520/5320 --> Found potential duplicate: 520: pages2k_1420&1608: fe23_northamerica_canada_cana111 (n_potential_duplicates=115) --> Found potential duplicate: 527: pages2k_1442&528: pages2k_1444 (n_potential_duplicates=116) Progress: 530/5320 Progress: 540/5320 --> Found potential duplicate: 542: pages2k_1488&595: pages2k_1628 (n_potential_duplicates=117) --> Found potential duplicate: 542: pages2k_1488&4138: ch2k_nu11pal01_52 (n_potential_duplicates=118) --> Found potential duplicate: 542: pages2k_1488&4456: iso2k_505 (n_potential_duplicates=119) --> Found potential duplicate: 542: pages2k_1488&4482: iso2k_579 (n_potential_duplicates=120) --> Found potential duplicate: 543: pages2k_1490&4139: ch2k_nu11pal01_54 (n_potential_duplicates=121) --> Found potential duplicate: 544: pages2k_1491&4481: iso2k_575 (n_potential_duplicates=122) --> Found potential duplicate: 547: pages2k_1497&4761: iso2k_1885 (n_potential_duplicates=123) Progress: 550/5320 --> Found potential duplicate: 550: pages2k_1515&552: pages2k_1519 (n_potential_duplicates=124) --> Found potential duplicate: 553: pages2k_1520&554: pages2k_1522 (n_potential_duplicates=125) Progress: 560/5320 --> Found potential duplicate: 564: pages2k_1547&4396: iso2k_259 (n_potential_duplicates=126) Progress: 570/5320 --> Found potential duplicate: 573: pages2k_1566&1712: fe23_northamerica_canada_cana231 (n_potential_duplicates=127) Progress: 580/5320 --> Found potential duplicate: 585: pages2k_1605&3154: fe23_northamerica_usa_ca606 (n_potential_duplicates=128) Progress: 590/5320 --> Found potential duplicate: 590: pages2k_1619&592: pages2k_1623 (n_potential_duplicates=129) --> Found potential duplicate: 595: pages2k_1628&4138: ch2k_nu11pal01_52 (n_potential_duplicates=130) --> Found potential duplicate: 595: pages2k_1628&4456: iso2k_505 (n_potential_duplicates=131) --> Found potential duplicate: 595: pages2k_1628&4482: iso2k_579 (n_potential_duplicates=132) --> Found potential duplicate: 597: pages2k_1636&4030: fe23_northamerica_usa_wa081 (n_potential_duplicates=133) Progress: 600/5320 Progress: 610/5320 --> Found potential duplicate: 614: pages2k_1686&615: pages2k_1688 (n_potential_duplicates=134) --> Found potential duplicate: 617: pages2k_1692&2421: fe23_asia_mong012 (n_potential_duplicates=135) --> Found potential duplicate: 619: pages2k_1703&4204: ch2k_mo06ped01_226 (n_potential_duplicates=136) --> Found potential duplicate: 619: pages2k_1703&4490: iso2k_629 (n_potential_duplicates=137) Progress: 620/5320 --> Found potential duplicate: 624: pages2k_1712&4504: iso2k_715 (n_potential_duplicates=138) --> Found potential duplicate: 628: pages2k_1720&4683: iso2k_1579 (n_potential_duplicates=139) Progress: 630/5320 --> Found potential duplicate: 635: pages2k_1741&4053: fe23_northamerica_usa_wa104 (n_potential_duplicates=140) --> Found potential duplicate: 638: pages2k_1750&4752: iso2k_1856 (n_potential_duplicates=141) --> Found potential duplicate: 638: pages2k_1750&4968: sisal_294.0_194 (n_potential_duplicates=142) Progress: 640/5320 --> Found potential duplicate: 644: pages2k_1771&4190: ch2k_tu01lai01_192 (n_potential_duplicates=143) Progress: 650/5320 --> Found potential duplicate: 656: pages2k_1804&3441: fe23_northamerica_usa_me010 (n_potential_duplicates=144) Progress: 660/5320 Progress: 670/5320 --> Found potential duplicate: 673: pages2k_1859&4212: ch2k_he10gua01_244 (n_potential_duplicates=145) --> Found potential duplicate: 673: pages2k_1859&4715: iso2k_1735 (n_potential_duplicates=146) --> Found potential duplicate: 674: pages2k_1861&4213: ch2k_he10gua01_246 (n_potential_duplicates=147) Progress: 680/5320 --> Found potential duplicate: 680: pages2k_1880&2823: fe23_northamerica_usa_ak060 (n_potential_duplicates=148) --> Found potential duplicate: 684: pages2k_1891&685: pages2k_1893 (n_potential_duplicates=149) Progress: 690/5320 --> Found potential duplicate: 695: pages2k_1918&4365: iso2k_102 (n_potential_duplicates=150) --> Found potential duplicate: 696: pages2k_1920&697: pages2k_1923 (n_potential_duplicates=151) Progress: 700/5320 --> Found potential duplicate: 700: pages2k_1932&701: pages2k_1934 (n_potential_duplicates=152) --> Found potential duplicate: 705: pages2k_1942&4128: ch2k_zi04ifr01_26 (n_potential_duplicates=153) --> Found potential duplicate: 705: pages2k_1942&4395: iso2k_257 (n_potential_duplicates=154) Progress: 710/5320 --> Found potential duplicate: 716: pages2k_1972&717: pages2k_1973 (n_potential_duplicates=155) --> Found potential duplicate: 718: pages2k_1976&720: pages2k_1980 (n_potential_duplicates=156) --> Found potential duplicate: 719: pages2k_1978&721: pages2k_1983 (n_potential_duplicates=157) Progress: 720/5320 --> Found potential duplicate: 722: pages2k_1985&4622: iso2k_1294 (n_potential_duplicates=158) --> Found potential duplicate: 724: pages2k_1989&725: pages2k_1991 (n_potential_duplicates=159) --> Found potential duplicate: 726: pages2k_1994&4216: ch2k_de12anc01_258 (n_potential_duplicates=160) Progress: 730/5320 --> Found potential duplicate: 732: pages2k_2013&1593: fe23_northamerica_canada_cana097 (n_potential_duplicates=161) Progress: 740/5320 --> Found potential duplicate: 742: pages2k_2042&4127: ch2k_tu95mad01_24 (n_potential_duplicates=162) --> Found potential duplicate: 742: pages2k_2042&4342: iso2k_20 (n_potential_duplicates=163) Progress: 750/5320 --> Found potential duplicate: 750: pages2k_2059&2821: fe23_northamerica_usa_ak058 (n_potential_duplicates=164) --> Found potential duplicate: 758: pages2k_2085&1517: fe23_northamerica_canada_cana002 (n_potential_duplicates=165) Progress: 760/5320 --> Found potential duplicate: 761: pages2k_2094&4290: ch2k_tu01dep01_450 (n_potential_duplicates=166) --> Found potential duplicate: 761: pages2k_2094&4602: iso2k_1201 (n_potential_duplicates=167) --> Found potential duplicate: 763: pages2k_2098&765: pages2k_2103 (n_potential_duplicates=168) --> Found potential duplicate: 768: pages2k_2110&3276: fe23_northamerica_usa_co554 (n_potential_duplicates=169) Progress: 770/5320 Progress: 780/5320 --> Found potential duplicate: 782: pages2k_2146&784: pages2k_2149 (n_potential_duplicates=170) --> Found potential duplicate: 782: pages2k_2146&785: pages2k_2150 (n_potential_duplicates=171) --> Found potential duplicate: 784: pages2k_2149&785: pages2k_2150 (n_potential_duplicates=172) --> Found potential duplicate: 788: pages2k_2156&1650: fe23_northamerica_canada_cana169w (n_potential_duplicates=173) Progress: 790/5320 Progress: 800/5320 --> Found potential duplicate: 808: pages2k_2214&4692: iso2k_1631 (n_potential_duplicates=174) Progress: 810/5320 --> Found potential duplicate: 813: pages2k_2220&2666: fe23_asia_russ127w (n_potential_duplicates=175) --> Found potential duplicate: 816: pages2k_2226&2416: fe23_asia_mong007w (n_potential_duplicates=176) Progress: 820/5320 Progress: 830/5320 --> Found potential duplicate: 833: pages2k_2265&2833: fe23_northamerica_usa_ak070 (n_potential_duplicates=177) Progress: 840/5320 --> Found potential duplicate: 842: pages2k_2287&843: pages2k_2290 (n_potential_duplicates=178) --> Found potential duplicate: 848: pages2k_2300&4183: ch2k_os14rip01_174 (n_potential_duplicates=179) Progress: 850/5320 --> Found potential duplicate: 850: pages2k_2303&2415: fe23_asia_mong006 (n_potential_duplicates=180) --> Found potential duplicate: 853: pages2k_2309&4197: ch2k_we09arr01_208 (n_potential_duplicates=181) --> Found potential duplicate: 854: pages2k_2311&4198: ch2k_we09arr01_210 (n_potential_duplicates=182) --> Found potential duplicate: 858: pages2k_2319&2877: fe23_northamerica_usa_ak6 (n_potential_duplicates=183) Progress: 860/5320 --> Found potential duplicate: 862: pages2k_2339&864: pages2k_2344 (n_potential_duplicates=184) Progress: 870/5320 --> Found potential duplicate: 870: pages2k_2361&4046: fe23_northamerica_usa_wa097 (n_potential_duplicates=185) Progress: 880/5320 --> Found potential duplicate: 883: pages2k_2402&3306: fe23_northamerica_usa_co586 (n_potential_duplicates=186) Progress: 890/5320 --> Found potential duplicate: 892: pages2k_2430&1610: fe23_northamerica_canada_cana113 (n_potential_duplicates=187) Progress: 900/5320 --> Found potential duplicate: 906: pages2k_2473&4103: fe23_northamerica_usa_wy022 (n_potential_duplicates=188) Progress: 910/5320 --> Found potential duplicate: 915: pages2k_2500&916: pages2k_2502 (n_potential_duplicates=189) Progress: 920/5320 --> Found potential duplicate: 920: pages2k_2510&4690: iso2k_1626 (n_potential_duplicates=190) --> Found potential duplicate: 922: pages2k_2514&4652: iso2k_1467 (n_potential_duplicates=191) --> Found potential duplicate: 924: pages2k_2517&4593: iso2k_1130 (n_potential_duplicates=192) Progress: 930/5320 --> Found potential duplicate: 930: pages2k_2534&4681: iso2k_1575 (n_potential_duplicates=193) --> Found potential duplicate: 932: pages2k_2538&4754: iso2k_1862 (n_potential_duplicates=194) Progress: 940/5320 --> Found potential duplicate: 940: pages2k_2561&1590: fe23_northamerica_canada_cana094 (n_potential_duplicates=195) --> Found potential duplicate: 948: pages2k_2592&950: pages2k_2596 (n_potential_duplicates=196) --> Found potential duplicate: 949: pages2k_2595&951: pages2k_2599 (n_potential_duplicates=197) Progress: 950/5320 --> Found potential duplicate: 954: pages2k_2604&955: pages2k_2606 (n_potential_duplicates=198) --> Found potential duplicate: 954: pages2k_2604&4657: iso2k_1481 (n_potential_duplicates=199) --> Found potential duplicate: 955: pages2k_2606&4657: iso2k_1481 (n_potential_duplicates=200) --> Found potential duplicate: 956: pages2k_2607&957: pages2k_2609 (n_potential_duplicates=201) --> Found potential duplicate: 956: pages2k_2607&959: pages2k_2612 (n_potential_duplicates=202) --> Found potential duplicate: 957: pages2k_2609&959: pages2k_2612 (n_potential_duplicates=203) Progress: 960/5320 --> Found potential duplicate: 960: pages2k_2613&4653: iso2k_1470 (n_potential_duplicates=204) --> Found potential duplicate: 962: pages2k_2617&4680: iso2k_1573 (n_potential_duplicates=205) Progress: 970/5320 --> Found potential duplicate: 970: pages2k_2634&3402: fe23_northamerica_usa_id013 (n_potential_duplicates=206) --> Found potential duplicate: 978: pages2k_2660&2777: fe23_northamerica_usa_ak014 (n_potential_duplicates=207) Progress: 980/5320 --> Found potential duplicate: 984: pages2k_2677&4104: fe23_northamerica_usa_wy023 (n_potential_duplicates=208) Progress: 990/5320 --> Found potential duplicate: 992: pages2k_2703&2856: fe23_northamerica_usa_ak094 (n_potential_duplicates=209) --> Found potential duplicate: 999: pages2k_2722&1719: fe23_northamerica_canada_cana238 (n_potential_duplicates=210) Progress: 1000/5320 --> Found potential duplicate: 1009: pages2k_2750&4707: iso2k_1708 (n_potential_duplicates=211) Progress: 1010/5320 --> Found potential duplicate: 1010: pages2k_2752&1011: pages2k_2755 (n_potential_duplicates=212) --> Found potential duplicate: 1010: pages2k_2752&1013: pages2k_2759 (n_potential_duplicates=213) --> Found potential duplicate: 1011: pages2k_2755&1013: pages2k_2759 (n_potential_duplicates=214) Progress: 1020/5320 --> Found potential duplicate: 1029: pages2k_2793&1030: pages2k_2795 (n_potential_duplicates=215) Progress: 1030/5320 --> Found potential duplicate: 1030: pages2k_2795&1032: pages2k_2798 (n_potential_duplicates=216) --> Found potential duplicate: 1031: pages2k_2796&1032: pages2k_2798 (n_potential_duplicates=217) Progress: 1040/5320 --> Found potential duplicate: 1043: pages2k_2830&2386: fe23_northamerica_mexico_mexi020 (n_potential_duplicates=218) --> Found potential duplicate: 1047: pages2k_2843&4032: fe23_northamerica_usa_wa083 (n_potential_duplicates=219) Progress: 1050/5320 Progress: 1060/5320 --> Found potential duplicate: 1066: pages2k_2899&1067: pages2k_2901 (n_potential_duplicates=220) --> Found potential duplicate: 1068: pages2k_2904&1069: pages2k_2906 (n_potential_duplicates=221) Progress: 1070/5320 --> Found potential duplicate: 1075: pages2k_2922&3151: fe23_northamerica_usa_ca603 (n_potential_duplicates=222) Progress: 1080/5320 --> Found potential duplicate: 1086: pages2k_2953&4480: iso2k_573 (n_potential_duplicates=223) --> Found potential duplicate: 1088: pages2k_2959&2406: fe23_northamerica_mexico_mexi043 (n_potential_duplicates=224) Progress: 1090/5320 --> Found potential duplicate: 1094: pages2k_2976&3397: fe23_northamerica_usa_id008 (n_potential_duplicates=225) Progress: 1100/5320 --> Found potential duplicate: 1102: pages2k_3002&3768: fe23_northamerica_usa_or043 (n_potential_duplicates=226) Progress: 1110/5320 --> Found potential duplicate: 1111: pages2k_3028&1112: pages2k_3030 (n_potential_duplicates=227) --> Found potential duplicate: 1111: pages2k_3028&1114: pages2k_3033 (n_potential_duplicates=228) --> Found potential duplicate: 1112: pages2k_3030&1114: pages2k_3033 (n_potential_duplicates=229) --> Found potential duplicate: 1116: pages2k_3038&3543: fe23_northamerica_usa_mt108 (n_potential_duplicates=230) Progress: 1120/5320 --> Found potential duplicate: 1125: pages2k_3064&4500: iso2k_698 (n_potential_duplicates=231) --> Found potential duplicate: 1126: pages2k_3068&4316: ch2k_zi14ifr02_522 (n_potential_duplicates=232) --> Found potential duplicate: 1126: pages2k_3068&4317: ch2k_zi14ifr02_524 (n_potential_duplicates=233) Progress: 1130/5320 --> Found potential duplicate: 1132: pages2k_3085&4173: ch2k_ku00nin01_150 (n_potential_duplicates=234) --> Found potential duplicate: 1132: pages2k_3085&4672: iso2k_1554 (n_potential_duplicates=235) --> Found potential duplicate: 1132: pages2k_3085&4673: iso2k_1556 (n_potential_duplicates=236) Progress: 1140/5320 --> Found potential duplicate: 1140: pages2k_3107&3274: fe23_northamerica_usa_co552 (n_potential_duplicates=237) --> Found potential duplicate: 1141: pages2k_3108&3274: fe23_northamerica_usa_co552 (n_potential_duplicates=238) --> Found potential duplicate: 1149: pages2k_3132&4170: ch2k_qu06rab01_144 (n_potential_duplicates=239) --> Found potential duplicate: 1149: pages2k_3132&4626: iso2k_1311 (n_potential_duplicates=240) Progress: 1150/5320 --> Found potential duplicate: 1150: pages2k_3134&4171: ch2k_qu06rab01_146 (n_potential_duplicates=241) Progress: 1160/5320 --> Found potential duplicate: 1164: pages2k_3170&2524: fe23_australia_newz062 (n_potential_duplicates=242) --> Found potential duplicate: 1167: pages2k_3179&2820: fe23_northamerica_usa_ak057 (n_potential_duplicates=243) Progress: 1170/5320 --> Found potential duplicate: 1170: pages2k_3188&1171: pages2k_3191 (n_potential_duplicates=244) --> Found potential duplicate: 1172: pages2k_3196&2420: fe23_asia_mong011 (n_potential_duplicates=245) --> Found potential duplicate: 1175: pages2k_3202&4712: iso2k_1727 (n_potential_duplicates=246) Progress: 1180/5320 --> Found potential duplicate: 1185: pages2k_3234&1186: pages2k_3236 (n_potential_duplicates=247) --> Found potential duplicate: 1185: pages2k_3234&1188: pages2k_3239 (n_potential_duplicates=248) --> Found potential duplicate: 1186: pages2k_3236&1188: pages2k_3239 (n_potential_duplicates=249) Progress: 1190/5320 --> Found potential duplicate: 1191: pages2k_3243&4339: iso2k_0 (n_potential_duplicates=250) --> Found potential duplicate: 1198: pages2k_3263&4613: iso2k_1264 (n_potential_duplicates=251) Progress: 1200/5320 --> Found potential duplicate: 1200: pages2k_3266&4269: ch2k_go12sbv01_396 (n_potential_duplicates=252) --> Found potential duplicate: 1200: pages2k_3266&4538: iso2k_870 (n_potential_duplicates=253) Progress: 1210/5320 --> Found potential duplicate: 1217: pages2k_3307&4417: iso2k_339 (n_potential_duplicates=254) --> Found potential duplicate: 1219: pages2k_3313&3109: fe23_northamerica_usa_ca560 (n_potential_duplicates=255) Progress: 1220/5320 --> Found potential duplicate: 1227: pages2k_3337&1229: pages2k_3342 (n_potential_duplicates=256) Progress: 1230/5320 --> Found potential duplicate: 1233: pages2k_3352&4301: ch2k_zi14tur01_480 (n_potential_duplicates=257) --> Found potential duplicate: 1233: pages2k_3352&4302: ch2k_zi14tur01_482 (n_potential_duplicates=258) --> Found potential duplicate: 1233: pages2k_3352&4412: iso2k_302 (n_potential_duplicates=259) Progress: 1240/5320 --> Found potential duplicate: 1243: pages2k_3372&4260: ch2k_ki04mcv01_366 (n_potential_duplicates=260) --> Found potential duplicate: 1243: pages2k_3372&4376: iso2k_155 (n_potential_duplicates=261) --> Found potential duplicate: 1244: pages2k_3374&4261: ch2k_ki04mcv01_368 (n_potential_duplicates=262) Progress: 1250/5320 --> Found potential duplicate: 1256: pages2k_3404&1528: fe23_northamerica_canada_cana029 (n_potential_duplicates=263) Progress: 1260/5320 --> Found potential duplicate: 1261: pages2k_3417&1262: pages2k_3419 (n_potential_duplicates=264) Progress: 1270/5320 Progress: 1280/5320 Progress: 1290/5320 --> Found potential duplicate: 1293: pages2k_3503&4021: fe23_northamerica_usa_wa072 (n_potential_duplicates=265) Progress: 1300/5320 --> Found potential duplicate: 1301: pages2k_3524&2773: fe23_northamerica_usa_ak010 (n_potential_duplicates=266) Progress: 1310/5320 --> Found potential duplicate: 1310: pages2k_3550&2676: fe23_asia_russ137w (n_potential_duplicates=267) --> Found potential duplicate: 1311: pages2k_3552&4684: iso2k_1581 (n_potential_duplicates=268) --> Found potential duplicate: 1312: pages2k_3554&4285: ch2k_li94sec01_436 (n_potential_duplicates=269) --> Found potential duplicate: 1312: pages2k_3554&4592: iso2k_1124 (n_potential_duplicates=270) --> Found potential duplicate: 1318: pages2k_3571&4377: iso2k_174 (n_potential_duplicates=271) Progress: 1320/5320 --> Found potential duplicate: 1322: pages2k_3583&3353: fe23_northamerica_usa_co633 (n_potential_duplicates=272) --> Found potential duplicate: 1328: pages2k_3599&4582: iso2k_1069 (n_potential_duplicates=273) --> Found potential duplicate: 1328: pages2k_3599&4701: iso2k_1660 (n_potential_duplicates=274) Progress: 1330/5320 --> Found potential duplicate: 1333: pages2k_3609&1549: fe23_northamerica_canada_cana053 (n_potential_duplicates=275) Progress: 1340/5320 --> Found potential duplicate: 1340: pages2k_3631&4668: iso2k_1530 (n_potential_duplicates=276) --> Found potential duplicate: 1344: pages2k_3642&4106: fe23_northamerica_usa_wy025 (n_potential_duplicates=277) Progress: 1350/5320 Progress: 1360/5320 Progress: 1370/5320 Progress: 1380/5320 Progress: 1390/5320 --> Found potential duplicate: 1391: fe23_southamerica_arge016&1460: fe23_southamerica_arge085 (n_potential_duplicates=278) Progress: 1400/5320 Progress: 1410/5320 Progress: 1420/5320 Progress: 1430/5320 Progress: 1440/5320 Progress: 1450/5320 Progress: 1460/5320 Progress: 1470/5320 Progress: 1480/5320 Progress: 1490/5320 Progress: 1500/5320 Progress: 1510/5320 Progress: 1520/5320 Progress: 1530/5320 Progress: 1540/5320 Progress: 1550/5320 Progress: 1560/5320 Progress: 1570/5320 Progress: 1580/5320 Progress: 1590/5320 --> Found potential duplicate: 1598: fe23_northamerica_canada_cana100&1694: fe23_northamerica_canada_cana213 (n_potential_duplicates=279) Progress: 1600/5320 --> Found potential duplicate: 1603: fe23_northamerica_canada_cana105&1698: fe23_northamerica_canada_cana217 (n_potential_duplicates=280) Progress: 1610/5320 --> Found potential duplicate: 1612: fe23_northamerica_canada_cana116&1649: fe23_northamerica_canada_cana168w (n_potential_duplicates=281) Progress: 1620/5320 Progress: 1630/5320 Progress: 1640/5320 --> Found potential duplicate: 1647: fe23_northamerica_canada_cana161&1648: fe23_northamerica_canada_cana162 (n_potential_duplicates=282) Progress: 1650/5320 Progress: 1660/5320 Progress: 1670/5320 Progress: 1680/5320 Progress: 1690/5320 Progress: 1700/5320 Progress: 1710/5320 Progress: 1720/5320 Progress: 1730/5320 Progress: 1740/5320 Progress: 1750/5320 Progress: 1760/5320 Progress: 1770/5320 Progress: 1780/5320 Progress: 1790/5320 --> Found potential duplicate: 1795: fe23_southamerica_chil016&1796: fe23_southamerica_chil017 (n_potential_duplicates=283) Progress: 1800/5320 Progress: 1810/5320 Progress: 1820/5320 Progress: 1830/5320 Progress: 1840/5320 Progress: 1850/5320 Progress: 1860/5320 Progress: 1870/5320 Progress: 1880/5320 Progress: 1890/5320 Progress: 1900/5320 Progress: 1910/5320 Progress: 1920/5320 Progress: 1930/5320 Progress: 1940/5320 Progress: 1950/5320 Progress: 1960/5320 Progress: 1970/5320 Progress: 1980/5320 Progress: 1990/5320 Progress: 2000/5320 Progress: 2010/5320 Progress: 2020/5320 Progress: 2030/5320 Progress: 2040/5320 Progress: 2050/5320 Progress: 2060/5320 Progress: 2070/5320 Progress: 2080/5320 Progress: 2090/5320 Progress: 2100/5320 Progress: 2110/5320 Progress: 2120/5320 Progress: 2130/5320 Progress: 2140/5320 Progress: 2150/5320 Progress: 2160/5320 Progress: 2170/5320 Progress: 2180/5320 Progress: 2190/5320 Progress: 2200/5320 --> Found potential duplicate: 2208: fe23_europe_swed019w&2210: fe23_europe_swed021w (n_potential_duplicates=284) Progress: 2210/5320 Progress: 2220/5320 Progress: 2230/5320 Progress: 2240/5320 Progress: 2250/5320 Progress: 2260/5320 Progress: 2270/5320 Progress: 2280/5320 Progress: 2290/5320 Progress: 2300/5320 Progress: 2310/5320 Progress: 2320/5320 Progress: 2330/5320 Progress: 2340/5320 Progress: 2350/5320 Progress: 2360/5320 Progress: 2370/5320 Progress: 2380/5320 --> Found potential duplicate: 2388: fe23_northamerica_mexico_mexi022&2389: fe23_northamerica_mexico_mexi023 (n_potential_duplicates=285) Progress: 2390/5320 Progress: 2400/5320 Progress: 2410/5320 Progress: 2420/5320 Progress: 2430/5320 Progress: 2440/5320 Progress: 2450/5320 Progress: 2460/5320 --> Found potential duplicate: 2469: fe23_australia_newz003&2522: fe23_australia_newz060 (n_potential_duplicates=286) Progress: 2470/5320 --> Found potential duplicate: 2473: fe23_australia_newz008&2554: fe23_australia_newz092 (n_potential_duplicates=287) --> Found potential duplicate: 2477: fe23_australia_newz014&2523: fe23_australia_newz061 (n_potential_duplicates=288) Progress: 2480/5320 --> Found potential duplicate: 2481: fe23_australia_newz018&2524: fe23_australia_newz062 (n_potential_duplicates=289) --> Found potential duplicate: 2482: fe23_australia_newz019&2525: fe23_australia_newz063 (n_potential_duplicates=290) Progress: 2490/5320 Progress: 2500/5320 Progress: 2510/5320 Progress: 2520/5320 Progress: 2530/5320 Progress: 2540/5320 Progress: 2550/5320 Progress: 2560/5320 Progress: 2570/5320 Progress: 2580/5320 Progress: 2590/5320 Progress: 2600/5320 Progress: 2610/5320 Progress: 2620/5320 Progress: 2630/5320 Progress: 2640/5320 Progress: 2650/5320 Progress: 2660/5320 Progress: 2670/5320 Progress: 2680/5320 Progress: 2690/5320 Progress: 2700/5320 Progress: 2710/5320 Progress: 2720/5320 Progress: 2730/5320 Progress: 2740/5320 Progress: 2750/5320 Progress: 2760/5320 Progress: 2770/5320 Progress: 2780/5320 Progress: 2790/5320 Progress: 2800/5320 Progress: 2810/5320 Progress: 2820/5320 Progress: 2830/5320 Progress: 2840/5320 Progress: 2850/5320 Progress: 2860/5320 Progress: 2870/5320 Progress: 2880/5320 Progress: 2890/5320 Progress: 2900/5320 Progress: 2910/5320 Progress: 2920/5320 Progress: 2930/5320 Progress: 2940/5320 Progress: 2950/5320 Progress: 2960/5320 Progress: 2970/5320 Progress: 2980/5320 Progress: 2990/5320 Progress: 3000/5320 Progress: 3010/5320 Progress: 3020/5320 Progress: 3030/5320 Progress: 3040/5320 --> Found potential duplicate: 3048: fe23_northamerica_usa_ca066&3176: fe23_northamerica_usa_ca628 (n_potential_duplicates=291) --> Found potential duplicate: 3049: fe23_northamerica_usa_ca067&3176: fe23_northamerica_usa_ca628 (n_potential_duplicates=292) Progress: 3050/5320 Progress: 3060/5320 --> Found potential duplicate: 3067: fe23_northamerica_usa_ca512&3161: fe23_northamerica_usa_ca613 (n_potential_duplicates=293) Progress: 3070/5320 Progress: 3080/5320 --> Found potential duplicate: 3084: fe23_northamerica_usa_ca535&3216: fe23_northamerica_usa_ca670 (n_potential_duplicates=294) Progress: 3090/5320 Progress: 3100/5320 Progress: 3110/5320 Progress: 3120/5320 Progress: 3130/5320 Progress: 3140/5320 Progress: 3150/5320 Progress: 3160/5320 Progress: 3170/5320 Progress: 3180/5320 Progress: 3190/5320 Progress: 3200/5320 Progress: 3210/5320 Progress: 3220/5320 Progress: 3230/5320 Progress: 3240/5320 Progress: 3250/5320 Progress: 3260/5320 Progress: 3270/5320 Progress: 3280/5320 Progress: 3290/5320 Progress: 3300/5320 Progress: 3310/5320 Progress: 3320/5320 Progress: 3330/5320 Progress: 3340/5320 Progress: 3350/5320 Progress: 3360/5320 Progress: 3370/5320 Progress: 3380/5320 Progress: 3390/5320 Progress: 3400/5320 Progress: 3410/5320 Progress: 3420/5320 Progress: 3430/5320 Progress: 3440/5320 --> Found potential duplicate: 3444: fe23_northamerica_usa_me017&3445: fe23_northamerica_usa_me018 (n_potential_duplicates=295) Progress: 3450/5320 Progress: 3460/5320 Progress: 3470/5320 Progress: 3480/5320 Progress: 3490/5320 --> Found potential duplicate: 3499: fe23_northamerica_usa_mo&3508: fe23_northamerica_usa_mo009 (n_potential_duplicates=296) Progress: 3500/5320 Progress: 3510/5320 Progress: 3520/5320 Progress: 3530/5320 Progress: 3540/5320 --> Found potential duplicate: 3547: fe23_northamerica_usa_mt112&3548: fe23_northamerica_usa_mt113 (n_potential_duplicates=297) Progress: 3550/5320 Progress: 3560/5320 Progress: 3570/5320 Progress: 3580/5320 --> Found potential duplicate: 3588: fe23_northamerica_usa_nj001&3589: fe23_northamerica_usa_nj002 (n_potential_duplicates=298) Progress: 3590/5320 Progress: 3600/5320 --> Found potential duplicate: 3602: fe23_northamerica_usa_nm024&3628: fe23_northamerica_usa_nm055 (n_potential_duplicates=299) Progress: 3610/5320 Progress: 3620/5320 Progress: 3630/5320 Progress: 3640/5320 Progress: 3650/5320 Progress: 3660/5320 Progress: 3670/5320 Progress: 3680/5320 --> Found potential duplicate: 3687: fe23_northamerica_usa_nv060&3705: fe23_northamerica_usa_nv518 (n_potential_duplicates=300) Progress: 3690/5320 --> Found potential duplicate: 3699: fe23_northamerica_usa_nv512&3708: fe23_northamerica_usa_nv521 (n_potential_duplicates=301) Progress: 3700/5320 --> Found potential duplicate: 3700: fe23_northamerica_usa_nv513&3707: fe23_northamerica_usa_nv520 (n_potential_duplicates=302) Progress: 3710/5320 Progress: 3720/5320 Progress: 3730/5320 Progress: 3740/5320 Progress: 3750/5320 Progress: 3760/5320 Progress: 3770/5320 Progress: 3780/5320 Progress: 3790/5320 Progress: 3800/5320 Progress: 3810/5320 Progress: 3820/5320 Progress: 3830/5320 Progress: 3840/5320 Progress: 3850/5320 Progress: 3860/5320 Progress: 3870/5320 Progress: 3880/5320 Progress: 3890/5320 Progress: 3900/5320 Progress: 3910/5320 Progress: 3920/5320 Progress: 3930/5320 Progress: 3940/5320 Progress: 3950/5320 Progress: 3960/5320 Progress: 3970/5320 Progress: 3980/5320 Progress: 3990/5320 Progress: 4000/5320 Progress: 4010/5320 Progress: 4020/5320 Progress: 4030/5320 Progress: 4040/5320 Progress: 4050/5320 Progress: 4060/5320 Progress: 4070/5320 Progress: 4080/5320 Progress: 4090/5320 Progress: 4100/5320 Progress: 4110/5320 --> Found potential duplicate: 4119: ch2k_zi15mer01_2&4120: ch2k_zi15mer01_4 (n_potential_duplicates=303) Progress: 4120/5320 --> Found potential duplicate: 4121: ch2k_co03pal03_6&4459: iso2k_511 (n_potential_duplicates=304) --> Found potential duplicate: 4122: ch2k_co03pal02_8&4458: iso2k_509 (n_potential_duplicates=305) --> Found potential duplicate: 4123: ch2k_li06rar01_12&4662: iso2k_1502 (n_potential_duplicates=306) --> Found potential duplicate: 4124: ch2k_co03pal07_14&4464: iso2k_521 (n_potential_duplicates=307) --> Found potential duplicate: 4126: ch2k_ur00mai01_22&4362: iso2k_94 (n_potential_duplicates=308) --> Found potential duplicate: 4126: ch2k_ur00mai01_22&4363: iso2k_98 (n_potential_duplicates=309) --> Found potential duplicate: 4127: ch2k_tu95mad01_24&4342: iso2k_20 (n_potential_duplicates=310) --> Found potential duplicate: 4128: ch2k_zi04ifr01_26&4395: iso2k_257 (n_potential_duplicates=311) --> Found potential duplicate: 4129: ch2k_re18cay01_30&4555: iso2k_917 (n_potential_duplicates=312) Progress: 4130/5320 --> Found potential duplicate: 4133: ch2k_ku99hou01_40&4518: iso2k_786 (n_potential_duplicates=313) --> Found potential duplicate: 4133: ch2k_ku99hou01_40&4519: iso2k_788 (n_potential_duplicates=314) --> Found potential duplicate: 4138: ch2k_nu11pal01_52&4456: iso2k_505 (n_potential_duplicates=315) --> Found potential duplicate: 4138: ch2k_nu11pal01_52&4482: iso2k_579 (n_potential_duplicates=316) Progress: 4140/5320 --> Found potential duplicate: 4141: ch2k_ca14tim01_64&4445: iso2k_473 (n_potential_duplicates=317) --> Found potential duplicate: 4146: ch2k_he08lra01_76&4736: iso2k_1813 (n_potential_duplicates=318) --> Found potential duplicate: 4147: ch2k_da06maf01_78&4719: iso2k_1748 (n_potential_duplicates=319) --> Found potential duplicate: 4148: ch2k_na09mal01_84&4722: iso2k_1754 (n_potential_duplicates=320) --> Found potential duplicate: 4149: ch2k_sw98stp01_86&4349: iso2k_50 (n_potential_duplicates=321) Progress: 4150/5320 --> Found potential duplicate: 4153: ch2k_da06maf02_104&4719: iso2k_1748 (n_potential_duplicates=322) --> Found potential duplicate: 4156: ch2k_co03pal01_110&4457: iso2k_507 (n_potential_duplicates=323) --> Found potential duplicate: 4159: ch2k_ch98pir01_116&4611: iso2k_1229 (n_potential_duplicates=324) Progress: 4160/5320 --> Found potential duplicate: 4164: ch2k_xi17hai01_128&4167: ch2k_xi17hai01_136 (n_potential_duplicates=325) --> Found potential duplicate: 4164: ch2k_xi17hai01_128&4724: iso2k_1762 (n_potential_duplicates=326) --> Found potential duplicate: 4165: ch2k_xi17hai01_130&4166: ch2k_xi17hai01_134 (n_potential_duplicates=327) --> Found potential duplicate: 4167: ch2k_xi17hai01_136&4724: iso2k_1762 (n_potential_duplicates=328) --> Found potential duplicate: 4168: ch2k_de14dto03_140&4172: ch2k_de14dto01_148 (n_potential_duplicates=329) Progress: 4170/5320 --> Found potential duplicate: 4170: ch2k_qu06rab01_144&4626: iso2k_1311 (n_potential_duplicates=330) --> Found potential duplicate: 4173: ch2k_ku00nin01_150&4672: iso2k_1554 (n_potential_duplicates=331) --> Found potential duplicate: 4173: ch2k_ku00nin01_150&4673: iso2k_1556 (n_potential_duplicates=332) Progress: 4180/5320 --> Found potential duplicate: 4187: ch2k_ev18roc01_184&4188: ch2k_ev18roc01_186 (n_potential_duplicates=333) --> Found potential duplicate: 4189: ch2k_ca13sap01_188&4478: iso2k_569 (n_potential_duplicates=334) Progress: 4190/5320 --> Found potential duplicate: 4191: ch2k_he13mis01_194&4383: iso2k_211 (n_potential_duplicates=335) --> Found potential duplicate: 4191: ch2k_he13mis01_194&4384: iso2k_213 (n_potential_duplicates=336) --> Found potential duplicate: 4193: ch2k_zi15imp02_200&4194: ch2k_zi15imp02_202 (n_potential_duplicates=337) --> Found potential duplicate: 4195: ch2k_pf04pba01_204&4705: iso2k_1701 (n_potential_duplicates=338) --> Found potential duplicate: 4195: ch2k_pf04pba01_204&4706: iso2k_1704 (n_potential_duplicates=339) --> Found potential duplicate: 4199: ch2k_co03pal05_212&4461: iso2k_515 (n_potential_duplicates=340) Progress: 4200/5320 --> Found potential duplicate: 4204: ch2k_mo06ped01_226&4490: iso2k_629 (n_potential_duplicates=341) --> Found potential duplicate: 4208: ch2k_os14ucp01_236&4422: iso2k_350 (n_potential_duplicates=342) Progress: 4210/5320 --> Found potential duplicate: 4212: ch2k_he10gua01_244&4715: iso2k_1735 (n_potential_duplicates=343) --> Found potential duplicate: 4219: ch2k_dr99abr01_264&4220: ch2k_dr99abr01_266 (n_potential_duplicates=344) --> Found potential duplicate: 4219: ch2k_dr99abr01_264&4361: iso2k_91 (n_potential_duplicates=345) Progress: 4220/5320 --> Found potential duplicate: 4220: ch2k_dr99abr01_266&4361: iso2k_91 (n_potential_duplicates=346) --> Found potential duplicate: 4221: ch2k_li06rar02_270&4661: iso2k_1500 (n_potential_duplicates=347) --> Found potential duplicate: 4225: ch2k_zi15tan01_278&4226: ch2k_zi15tan01_280 (n_potential_duplicates=348) Progress: 4230/5320 --> Found potential duplicate: 4235: ch2k_as05gua01_302&4675: iso2k_1559 (n_potential_duplicates=349) --> Found potential duplicate: 4236: ch2k_fe09oga01_304&4769: iso2k_1922 (n_potential_duplicates=350) Progress: 4240/5320 --> Found potential duplicate: 4240: ch2k_gu99nau01_314&4501: iso2k_702 (n_potential_duplicates=351) --> Found potential duplicate: 4240: ch2k_gu99nau01_314&4502: iso2k_705 (n_potential_duplicates=352) --> Found potential duplicate: 4243: ch2k_co03pal10_324&4463: iso2k_519 (n_potential_duplicates=353) --> Found potential duplicate: 4245: ch2k_zi15imp01_328&4246: ch2k_zi15imp01_330 (n_potential_duplicates=354) --> Found potential duplicate: 4249: ch2k_ro19yuc01_338&4250: ch2k_ro19yuc01_340 (n_potential_duplicates=355) Progress: 4250/5320 --> Found potential duplicate: 4257: ch2k_co03pal09_358&4466: iso2k_525 (n_potential_duplicates=356) Progress: 4260/5320 --> Found potential duplicate: 4260: ch2k_ki04mcv01_366&4376: iso2k_155 (n_potential_duplicates=357) --> Found potential duplicate: 4264: ch2k_ba04fij02_382&4350: iso2k_52 (n_potential_duplicates=358) --> Found potential duplicate: 4265: ch2k_co03pal06_386&4462: iso2k_517 (n_potential_duplicates=359) --> Found potential duplicate: 4269: ch2k_go12sbv01_396&4538: iso2k_870 (n_potential_duplicates=360) Progress: 4270/5320 --> Found potential duplicate: 4271: ch2k_ca07fli01_400&4579: iso2k_1057 (n_potential_duplicates=361) --> Found potential duplicate: 4274: ch2k_co93tar01_408&4469: iso2k_539 (n_potential_duplicates=362) --> Found potential duplicate: 4276: ch2k_co00mal01_412&4570: iso2k_1010 (n_potential_duplicates=363) Progress: 4280/5320 --> Found potential duplicate: 4280: ch2k_qu96esv01_422&4386: iso2k_218 (n_potential_duplicates=364) --> Found potential duplicate: 4281: ch2k_de13hai01_424&4284: ch2k_de13hai01_432 (n_potential_duplicates=365) --> Found potential duplicate: 4281: ch2k_de13hai01_424&4696: iso2k_1643 (n_potential_duplicates=366) --> Found potential duplicate: 4282: ch2k_de13hai01_426&4283: ch2k_de13hai01_430 (n_potential_duplicates=367) --> Found potential duplicate: 4284: ch2k_de13hai01_432&4696: iso2k_1643 (n_potential_duplicates=368) --> Found potential duplicate: 4285: ch2k_li94sec01_436&4592: iso2k_1124 (n_potential_duplicates=369) --> Found potential duplicate: 4286: ch2k_zi15cle01_438&4287: ch2k_zi15cle01_440 (n_potential_duplicates=370) Progress: 4290/5320 --> Found potential duplicate: 4290: ch2k_tu01dep01_450&4602: iso2k_1201 (n_potential_duplicates=371) --> Found potential duplicate: 4291: ch2k_co03pal04_452&4460: iso2k_513 (n_potential_duplicates=372) --> Found potential duplicate: 4294: ch2k_fl18dto01_460&4324: ch2k_fl18dto02_554 (n_potential_duplicates=373) --> Found potential duplicate: 4297: ch2k_du94urv01_468&4298: ch2k_du94urv01_470 (n_potential_duplicates=374) --> Found potential duplicate: 4299: ch2k_co03pal08_472&4465: iso2k_523 (n_potential_duplicates=375) Progress: 4300/5320 --> Found potential duplicate: 4301: ch2k_zi14tur01_480&4302: ch2k_zi14tur01_482 (n_potential_duplicates=376) --> Found potential duplicate: 4301: ch2k_zi14tur01_480&4412: iso2k_302 (n_potential_duplicates=377) --> Found potential duplicate: 4302: ch2k_zi14tur01_482&4412: iso2k_302 (n_potential_duplicates=378) --> Found potential duplicate: 4303: ch2k_li99cli01_486&4679: iso2k_1571 (n_potential_duplicates=379) --> Found potential duplicate: 4304: ch2k_zi15bun01_488&4305: ch2k_zi15bun01_490 (n_potential_duplicates=380) --> Found potential duplicate: 4306: ch2k_fe18rus01_492&4753: iso2k_1861 (n_potential_duplicates=381) Progress: 4310/5320 --> Found potential duplicate: 4310: ch2k_wu13ton01_504&4311: ch2k_wu13ton01_506 (n_potential_duplicates=382) --> Found potential duplicate: 4312: ch2k_ki14par01_510&4315: ch2k_ki14par01_518 (n_potential_duplicates=383) --> Found potential duplicate: 4313: ch2k_ki14par01_512&4314: ch2k_ki14par01_516 (n_potential_duplicates=384) --> Found potential duplicate: 4316: ch2k_zi14ifr02_522&4317: ch2k_zi14ifr02_524 (n_potential_duplicates=385) Progress: 4320/5320 --> Found potential duplicate: 4325: ch2k_ba04fij01_558&4351: iso2k_55 (n_potential_duplicates=386) --> Found potential duplicate: 4328: ch2k_li06fij01_582&4423: iso2k_353 (n_potential_duplicates=387) Progress: 4330/5320 Progress: 4340/5320 Progress: 4350/5320 --> Found potential duplicate: 4352: iso2k_58&4581: iso2k_1068 (n_potential_duplicates=388) Progress: 4360/5320 --> Found potential duplicate: 4362: iso2k_94&4363: iso2k_98 (n_potential_duplicates=389) Progress: 4370/5320 --> Found potential duplicate: 4370: iso2k_120&4945: sisal_253.0_171 (n_potential_duplicates=390) --> Found potential duplicate: 4375: iso2k_140&4958: sisal_278.0_184 (n_potential_duplicates=391) Progress: 4380/5320 --> Found potential duplicate: 4388: iso2k_236&4915: sisal_205.0_141 (n_potential_duplicates=392) Progress: 4390/5320 Progress: 4400/5320 --> Found potential duplicate: 4408: iso2k_296&4409: iso2k_298 (n_potential_duplicates=393) --> Found potential duplicate: 4408: iso2k_296&4410: iso2k_299 (n_potential_duplicates=394) --> Found potential duplicate: 4409: iso2k_298&4410: iso2k_299 (n_potential_duplicates=395) Progress: 4410/5320 Progress: 4420/5320 --> Found potential duplicate: 4428: iso2k_380&5066: sisal_446.0_292 (n_potential_duplicates=396) Progress: 4430/5320 --> Found potential duplicate: 4430: iso2k_399&4521: iso2k_806 (n_potential_duplicates=397) --> Found potential duplicate: 4430: iso2k_399&4522: iso2k_811 (n_potential_duplicates=398)
/home/jupyter-mnevans/.conda/envs/cfr-env/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide c /= stddev[:, None] /home/jupyter-mnevans/.conda/envs/cfr-env/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide c /= stddev[None, :]
Progress: 4440/5320 Progress: 4450/5320 --> Found potential duplicate: 4456: iso2k_505&4482: iso2k_579 (n_potential_duplicates=399) Progress: 4460/5320 --> Found potential duplicate: 4468: iso2k_533&4843: sisal_115.0_69 (n_potential_duplicates=400) Progress: 4470/5320 --> Found potential duplicate: 4471: iso2k_546&4473: iso2k_549 (n_potential_duplicates=401) --> Found potential duplicate: 4472: iso2k_547&4474: iso2k_550 (n_potential_duplicates=402) Progress: 4480/5320 Progress: 4490/5320 Progress: 4500/5320 --> Found potential duplicate: 4501: iso2k_702&4502: iso2k_705 (n_potential_duplicates=403) Progress: 4510/5320 --> Found potential duplicate: 4514: iso2k_772&4515: iso2k_775 (n_potential_duplicates=404) --> Found potential duplicate: 4518: iso2k_786&4519: iso2k_788 (n_potential_duplicates=405) Progress: 4520/5320 --> Found potential duplicate: 4521: iso2k_806&4522: iso2k_811 (n_potential_duplicates=406) Progress: 4530/5320 --> Found potential duplicate: 4539: iso2k_873&5088: sisal_471.0_314 (n_potential_duplicates=407) Progress: 4540/5320 Progress: 4550/5320 Progress: 4560/5320 Progress: 4570/5320 Progress: 4580/5320 --> Found potential duplicate: 4582: iso2k_1069&4701: iso2k_1660 (n_potential_duplicates=408) --> Found potential duplicate: 4588: iso2k_1107&4737: iso2k_1817 (n_potential_duplicates=409) --> Found potential duplicate: 4588: iso2k_1107&4948: sisal_271.0_174 (n_potential_duplicates=410) Progress: 4590/5320 --> Found potential duplicate: 4599: iso2k_1178&4907: sisal_201.0_133 (n_potential_duplicates=411) Progress: 4600/5320 Progress: 4610/5320 --> Found potential duplicate: 4617: iso2k_1283&4618: iso2k_1286 (n_potential_duplicates=412) Progress: 4620/5320 --> Found potential duplicate: 4620: iso2k_1288&4987: sisal_329.0_213 (n_potential_duplicates=413) --> Found potential duplicate: 4621: iso2k_1291&4989: sisal_330.0_215 (n_potential_duplicates=414) Progress: 4630/5320 Progress: 4640/5320 Progress: 4650/5320 --> Found potential duplicate: 4659: iso2k_1495&4973: sisal_305.0_199 (n_potential_duplicates=415) Progress: 4660/5320 --> Found potential duplicate: 4663: iso2k_1504&4840: sisal_113.0_66 (n_potential_duplicates=416) Progress: 4670/5320 --> Found potential duplicate: 4672: iso2k_1554&4673: iso2k_1556 (n_potential_duplicates=417) Progress: 4680/5320 Progress: 4690/5320 Progress: 4700/5320 --> Found potential duplicate: 4705: iso2k_1701&4706: iso2k_1704 (n_potential_duplicates=418) Progress: 4710/5320 Progress: 4720/5320 Progress: 4730/5320 --> Found potential duplicate: 4737: iso2k_1817&4948: sisal_271.0_174 (n_potential_duplicates=419) --> Found potential duplicate: 4738: iso2k_1820&4951: sisal_272.0_177 (n_potential_duplicates=420) --> Found potential duplicate: 4739: iso2k_1823&4953: sisal_273.0_179 (n_potential_duplicates=421) Progress: 4740/5320 --> Found potential duplicate: 4745: iso2k_1848&4751: iso2k_1855 (n_potential_duplicates=422) --> Found potential duplicate: 4746: iso2k_1850&4747: iso2k_1851 (n_potential_duplicates=423) Progress: 4750/5320 --> Found potential duplicate: 4752: iso2k_1856&4968: sisal_294.0_194 (n_potential_duplicates=424) Progress: 4760/5320 Progress: 4770/5320 Progress: 4780/5320 Progress: 4790/5320 --> Found potential duplicate: 4792: sisal_46.0_18&4795: sisal_47.0_21 (n_potential_duplicates=425) --> Found potential duplicate: 4793: sisal_46.0_19&4796: sisal_47.0_22 (n_potential_duplicates=426) --> Found potential duplicate: 4794: sisal_46.0_20&4797: sisal_47.0_23 (n_potential_duplicates=427) Progress: 4800/5320 Progress: 4810/5320 Progress: 4820/5320 Progress: 4830/5320 Progress: 4840/5320 Progress: 4850/5320 Progress: 4860/5320 Progress: 4870/5320 Progress: 4880/5320 Progress: 4890/5320 Progress: 4900/5320 Progress: 4910/5320 Progress: 4920/5320 Progress: 4930/5320 Progress: 4940/5320 Progress: 4950/5320 Progress: 4960/5320 Progress: 4970/5320 Progress: 4980/5320 Progress: 4990/5320 Progress: 5000/5320 Progress: 5010/5320 Progress: 5020/5320 Progress: 5030/5320 Progress: 5040/5320 --> Found potential duplicate: 5044: sisal_430.0_270&5305: sisal_896.0_531 (n_potential_duplicates=428) --> Found potential duplicate: 5045: sisal_430.0_271&5307: sisal_896.0_533 (n_potential_duplicates=429) Progress: 5050/5320 Progress: 5060/5320 Progress: 5070/5320 Progress: 5080/5320 Progress: 5090/5320 Progress: 5100/5320 Progress: 5110/5320 Progress: 5120/5320 Progress: 5130/5320 Progress: 5140/5320 Progress: 5150/5320 Progress: 5160/5320 Progress: 5170/5320 Progress: 5180/5320 Progress: 5190/5320 Progress: 5200/5320 Progress: 5210/5320 Progress: 5220/5320 Progress: 5230/5320 Progress: 5240/5320 Progress: 5250/5320 Progress: 5260/5320 Progress: 5270/5320 Progress: 5280/5320 Progress: 5290/5320 Progress: 5300/5320 Progress: 5310/5320 ============================================================ Saved indices, IDs, distances, correlations in data/all_merged/dup_detection/ ============================================================ Detected 429 possible duplicates in all_merged. ============================================================
Plot duplicate candidate pairs¶
OPTIONAL: plot the duplicate candidate pairs, which were flagged by the duplicate detection algorithm.
The function plot_duplicates loads the flagged candidate pairs for a database named DB from csv (data/DB/dup_detection/dup_detection_candidates_DB.csv) and produces summary figures of the potential duplicates, which are saved in the directory figs/DB/dup_detection/.
Note that the same summary figures are being used for the duplicate decision process (dup_decisions.ipynb).
dup.plot_duplicates(df, save_figures=True, display=False)
[autoreload of dod2k_utilities.ut_functions failed: Traceback (most recent call last):
File "/home/jupyter-mnevans/.conda/envs/cfr-env/lib/python3.11/site-packages/IPython/extensions/autoreload.py", line 276, in check
superreload(m, reload, self.old_objects)
File "/home/jupyter-mnevans/.conda/envs/cfr-env/lib/python3.11/site-packages/IPython/extensions/autoreload.py", line 475, in superreload
module = reload(module)
^^^^^^^^^^^^^^
File "/home/jupyter-mnevans/.conda/envs/cfr-env/lib/python3.11/importlib/__init__.py", line 169, in reload
_bootstrap._exec(spec, module)
File "<frozen importlib._bootstrap>", line 621, in _exec
File "<frozen importlib._bootstrap_external>", line 936, in exec_module
File "<frozen importlib._bootstrap_external>", line 1074, in get_code
File "<frozen importlib._bootstrap_external>", line 1004, in source_to_code
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/home/jupyter-lluecke/dod2k/dod2k_utilities/ut_functions.py", line 493
format='png', dpi=300, bbox_inches='tight', pad_inches=0.0)
^
SyntaxError: unmatched ')'
]
> 0/429,pages2k_0,iso2k_296,0.0,0.9999999995947852 ERROR! Session/line number was not unique in database. History logging moved to new session 2492
saved figure in /home/jupyter-lluecke/dod2k/figs/all_merged/dup_detection/000_pages2k_0_iso2k_296__0_4408.pdf === POTENTIAL DUPLICATE 0/429: pages2k_0+iso2k_296 === > 1/429,pages2k_0,iso2k_298,0.0,0.9999999995947852
saved figure in /home/jupyter-lluecke/dod2k/figs/all_merged/dup_detection/001_pages2k_0_iso2k_298__0_4409.pdf === POTENTIAL DUPLICATE 1/429: pages2k_0+iso2k_298 === > 2/429,pages2k_0,iso2k_299,0.0,0.9999999995947852
saved figure in /home/jupyter-lluecke/dod2k/figs/all_merged/dup_detection/002_pages2k_0_iso2k_299__0_4410.pdf === POTENTIAL DUPLICATE 2/429: pages2k_0+iso2k_299 === > 3/429,pages2k_6,FE23_northamerica_usa_az555,5.775408685862238,0.978353859816631
saved figure in /home/jupyter-lluecke/dod2k/figs/all_merged/dup_detection/003_pages2k_6_FE23_northamerica_usa_az555__2_3037.pdf === POTENTIAL DUPLICATE 3/429: pages2k_6+FE23_northamerica_usa_az555 === > 4/429,pages2k_50,FE23_northamerica_canada_cana091,3.197082790629511,0.9674400553180403
saved figure in /home/jupyter-lluecke/dod2k/figs/all_merged/dup_detection/004_pages2k_50_FE23_northamerica_canada_cana091__17_1587.pdf === POTENTIAL DUPLICATE 4/429: pages2k_50+FE23_northamerica_canada_cana091 === > 5/429,pages2k_62,pages2k_63,0.0,0.9442037258051723
saved figure in /home/jupyter-lluecke/dod2k/figs/all_merged/dup_detection/005_pages2k_62_pages2k_63__20_21.pdf === POTENTIAL DUPLICATE 5/429: pages2k_62+pages2k_63 === > 6/429,pages2k_81,ch2k_HE08LRA01_76,0.0,0.9999999922133574
saved figure in /home/jupyter-lluecke/dod2k/figs/all_merged/dup_detection/006_pages2k_81_ch2k_HE08LRA01_76__29_4146.pdf === POTENTIAL DUPLICATE 6/429: pages2k_81+ch2k_HE08LRA01_76 === > 7/429,pages2k_81,iso2k_1813,0.0,0.9999999922133574
saved figure in /home/jupyter-lluecke/dod2k/figs/all_merged/dup_detection/007_pages2k_81_iso2k_1813__29_4736.pdf === POTENTIAL DUPLICATE 7/429: pages2k_81+iso2k_1813 === > 8/429,pages2k_83,iso2k_1916,0.0,0.9999999999999999
fn = utf.find(f'dup_detection_candidates_{df.name}.csv', f'data/{df.name}/dup_detection')
if fn != []:
print('----------------------------------------------------')
print('Sucessfully finished the duplicate detection process!'.upper())
print('----------------------------------------------------')
print('Saved the detection output file in:')
print()
print('%s.'%', '.join(fn))
print()
print('You are now able to proceed to the next notebook: dup_decision.ipynb')
else:
print('Final output file is missing.')
print()
print('Please re-run the notebook to complete duplicate detection process.')
---------------------------------------------------- SUCESSFULLY FINISHED THE DUPLICATE DETECTION PROCESS! ---------------------------------------------------- Saved the detection output file in: data/all_merged/dup_detection/dup_detection_candidates_all_merged.csv. You are now able to proceed to the next notebook: dup_decision.ipynb